The Foundational Data Initiative for Parkinson Disease: Enabling efficient translation from genetic maps to mechanism

Summary The Foundational Data Initiative for Parkinson Disease (FOUNDIN-PD) is an international collaboration producing fundamental resources for Parkinson disease (PD). FOUNDIN-PD generated a multi-layered molecular dataset in a cohort of induced pluripotent stem cell (iPSC) lines differentiated to dopaminergic (DA) neurons, a major affected cell type in PD. The lines were derived from the Parkinson’s Progression Markers Initiative study, which included participants with PD carrying monogenic PD variants, variants with intermediate effects, and variants identified by genome-wide association studies and unaffected individuals. We generated genetic, epigenetic, regulatory, transcriptomic, and longitudinal cellular imaging data from iPSC-derived DA neurons to understand molecular relationships between disease-associated genetic variation and proximate molecular events. These data reveal that iPSC-derived DA neurons provide a valuable cellular context and foundational atlas for modeling PD genetic risk. We have integrated these data into a FOUNDIN-PD data browser as a resource for understanding the molecular pathogenesis of PD.


In brief
Bressan et al. report the first data release of the Foundational Data Initiative for Parkinson Disease (FOUNDIN-PD), which includes a multi-layered molecular dataset of 95 induced pluripotent stem cell (iPSC) lines with different genetic risk backgrounds differentiated to dopaminergic (DA) neurons, a major affected cell type in Parkinson disease (PD).

INTRODUCTION
Our understanding of the genetic architecture of Parkinson disease (PD) has expanded considerably over the last decade. Investigations of rare monogenic forms of PD and parkinsonism have revealed multiple genes that contain disease-causing mu-tations. 1 Additionally, iterative application of genome-wide association studies (GWASs) in increasingly larger sample sizes have identified 90 independent risk variants for PD, which cumulatively contribute to 16%-36% of the heritable risk for the disease. 2 One of the main pathological hallmarks of PD is the progressive degeneration of dopaminergic (DA) neurons in the substantia nigra and the accumulation of alpha-synuclein protein aggregates, known as Lewy bodies and Lewy neurites. 3 Additionally, previous work has highlighted that genetic risk in PD is likely to play a significant role in DA neurons. 2,4 On a clinical level, there is large variability in age at onset and progression across patients with both monogenic and idiopathic PD, even in those carrying the same damaging variant. This variation is likely caused by a combination of environmental and genetic factors, and while some environmental risk factors have been identified, such as smoking and exposure to pesticides, studying the environment remains complex. 5 For this reason, within this study, we have chosen to focus on genetic risk factors in the context of PD. Interestingly, several genetic risk factors for PD identified by GWASs also influence the overall risk in carriers of LRRK2 or GBA1 mutations, 6,7 which are the most common genetic causes of PD. In addition, multiple GWAS-nominated loci include genes implicated in monogenic forms of PD (e.g., SNCA and LRRK2), highlighting a clear etiologic link between monogenic and sporadic disease. Thus, understanding the molecular mechanisms underlying known genetic risk factors and mutations would provide actionable insights into the biology of disease risk, onset, progression, and modifiers of disease.
While the pace of genetic discovery has increased dramatically in recent years, our ability to characterize the associated function and dysfunction of nominated genes and risk loci has not matched this progress. Research centered on the biology of genes that contain rare disease-causing mutations has revealed important insights into the molecular mechanisms leading to disease; however, it is challenging to demonstrate how risk loci identified by GWASs may lead to disease. This is largely due to the complexity of these risk signals and the lack of largescale reference data to interpret the molecular outcomes at these risk loci. A significant issue arises when unraveling GWAS loci due to the complex architecture of the human genome, meaning that modifier and risk loci identified by GWASs generally nominate genomic regions and not specific genes. Adding to this complexity, disease effect sizes are modest, the cellular context is often unknown, and the genetic mediator is generally unlikely to be protein-coding. Extensive experimental work has provided clear insights into the molecular consequences of these variants but has not yet shown the influence of additional risk factors on these molecular disturbances, which is essential to understand why some carriers of these risk factors develop disease and others do not.
Studying the biology of GWAS loci in traditional cellular and animal models is extremely challenging due to large linkage disequilibrium (LD) blocks resulting in many highly correlated variants. Additionally, variants identified by GWASs are generally non-coding, and correlating these variants to a causative gene is difficult. Low effect sizes and uncertainty in the resulting phenotype further confounds the identification of adequate models. Therefore, the large number of known and to-be-discovered risk loci require an alternative strategy to understand the underlying biology. The development of human induced pluripotent stem cell (iPSC)-based cellular models provides a unique opportunity to address the collective impact of genetic risk factors and define the relevant cellular context for modeling these variants at scale. It is important to note that iPSC models are unlikely to be
PPMI lines were differentiated in five batches (ranging from 10 to 30 cell lines per batch) until day 25 or 65, followed by harvesting the cells for ICC and molecular assays. Quantification of MAP2+ and TH+ cells revealed that, on average, 80%   and S2), showing a higher variability in differentiation efficiency than the in-house iPSC lines used in protocol optimization. The average proportion of TH+ cells in the iPSC lines, relative to all cells in the culture, was similar when assessed by ICC with two independent TH antibodies, and the estimate of the proportion of MAP2+ cells, relative to all cells, was also independent of the MAP2 antibody used ( Figure S4A). To measure how robust and reproducible the differentiation protocol was using our automated system, we included a control line in each batch as a technical replicate (n = 5). The percentage of MAP2+ and TH+ cells obtained from the control cell line using ICC across all five differentiation batches was consistent ( Figure S4B), and no significant differences in the percentage of neurons or MAP2+ and TH+ neurons between batches were identified (p > 0.2 for both).
Quantifying gene expression in FOUNDIN-PD data using RNA sequencing To further characterize the iPSC-derived neurons, we generated a wealth of data types including genetic, epigenetic, regulatory, transcriptomic, and cellular imaging data ( Figure 1). To fully characterize the identity of the cell types generated by the iPSC differentiation protocol used in the present study, we performed single-cell RNA sequencing (scRNA-seq) on the majority of the day-65 cell lines (n = 79 with 4 control replicates, 84% of total included iPSCs). In total, 416,216 high-quality cells were retained, with an average of 5,015 cells per sample 13 (range: 584 to 9,640). Cells were first clustered using an unsupervised method (Louvain algorithm) and then annotated based on canonical cell-type markers found in the differentially expressed genes of the cluster (Table S2; Figure S5).  (5,687 total, 1% of total) ( Figure 2C). Overall, expression of TH, MAP2, and SNCA was clearly enriched in the neuronal cell types ( Figure 2D). Next, we assessed how similar the expression signatures are of the cultured DA neurons vs. human tissue DA neurons and also how our cultured DA neurons compare with previously published DA neuron datasets. We compared our identified cell type populations with public datasets from human postmortem substantia nigra 12 and human iPSC-derived DA neuron subtypes using a slightly modified DA neuron differentiation protocol and a distinct set of iPSC cell lines. 13 The DA neuron population identified in our data showed the highest correlation (Spearman's R = 0.69) with the TH+ neuron cluster found in human postmortem substantia nigra ( Figure 2E). This correlation was also identified using dendrogram clustering of DA neurons from this study and the TH+ neurons from human postmortem substantia nigra ( Figures S6A and S6B). The second highest correlation was observed between our immature DA neurons and the TH+ neuronal cluster from Agarwal et al. 12 (Spearman's R = 0.67; Figures S6A and S6B). Additionally, both immature and DA neurons were highly correlated with the iPSC-derived DA neuron subtypes (DAn1-4) identified by Fernandes and collaborators 13 ( Figure S6C), which were produced using a similar iPSC-to-DA neuron differentiation protocol. Another similarity detected between both iPSC-derived neuron datasets was the expression of serotoninergic markers in our immature DA neurons (FOUNDIN-PD; Table S2) and the previously published DAn2. 13 To validate the neuronal cell types identified by scRNA, we compared ICC-based estimates of DA neurons (TH+ cells) and overall neurons (MAP2+ cells) with the percentage of positive cells obtained from scRNA-seq data. We found high correlations between the ICC and scRNA-seq data (Pearson correlation of R = 0.8562, p < 0.0001, and R = 0.8916, p < 0.0001, for TH [Pel-Freeze] and MAP2, respectively; Figures 2F and 2G). Similar results were obtained with a second TH (Millipore) antibody (Figure S7). Although the differentiation efficiency (percentage of each cell type) varied between cell lines ( Figure 2H), no consistent cell-type enrichment could be identified based on batch, phenotype, recruitment category, genetic sex, or PD-linked genotype (GBA1+, LRRK2+, SNCA+) ( Figure S8). Additionally, a very high correlation was observed (R > 0.9) between technical replicates (n = 4) using gene-level scRNA-seq data of the identified DA neuron cluster ( Figure S9) and total TH and MAP2 levels (C) Uniform manifold approximation and projection (UMAP) illustrates cell clusters identified at day 65 (n = 416,216 single cells, n = 79 + 4 control replicate cell lines). Cell types with their respective percentages are indicated. (D) Percentage of cells and average expression level of TH, MAP2, and SNCA for each cell type. The dot color scale from blue to red corresponds to lower and higher expression, respectively. The size of the dot is directly proportional to the percentage of cells expressing the gene in a given cell type. PFPP, proliferating floor plate progenitors; Prog, progenitors. (E) Spearman's correlation test showing high correlation of gene expression across FOUNDIN-PD DA neuronal types and postmortem substantia nigra human brain. ODCs, oligodendrocytes; OPCs, oligodendrocyte precursor cells. See Figure S6A for UMAP of cell types identified by using Agarwal and collaborators' data. 12 (F and G) Correlation between percentages of TH+ (Pel-Freez) and MAP2+ cells in ICC and scRNA-seq (R, Pearson correlation coefficient; p < 0.0001). Each dot represents one cell line (n = 83).  Figure S4), suggesting that, while there is variability in differentiation efficiency across the lines, this is likely not caused by the differentiation protocol but may be due to inherent characteristics of each individual line.
To assess gene expression differences across multiple time points during differentiation, we generated bulk RNA-seq at days 0 (n = 99), 25 (n = 98), and 65 (n = 96), with each time point including five technical replicates of the control line. A principalcomponent analysis (PCA) of bulk RNA-seq showed clear clustering by time point ( Figure 3A). In accordance with the scRNA-seq data, we also observed a very high correlation (R > 0.9) between the technical replicates in the gene-level expression of each time point in bulk RNA-seq data (Figures S10A-S10C). Further evaluation of the bulk RNA-seq data across all time points showed clear transcriptional enrichment signatures that correlated with neuron-like features, including synapse assembly, neurotransmitter transport, and action potential (Table S4) on days 25 and 65. Additionally, specific genes of interest, such as MAP2 and TH ( Figures 3B and 3C), and GBA1, SNCA, LRRK2, and SYN1 (Figures S11A-S11D) showed increased expression levels as cells were differentiated. Concurrently, iPSC-associated genes such as POU5F1 ( Figure 3D), NANOG, and TDGF1 (Figures S11E and S11F) showed significantly reduced expression at later time points relative to day 0, which correlated with a decrease in pathway signatures of iPSC differentiation and growth, including somatic stem cell population maintenance and positive regulation of cell population proliferation (Table S5).
Next, we used the day-0 bulk RNA-seq gene expression data to predict DA neuronal differentiation efficiency. We defined DA neuronal differentiation efficiency as the fraction of DA neurons in our scRNA-seq datasets at day 65 using a method similar to that described by Jerber and collaborators. 15 Using logistic regression, ten genes were identified that had a non-zero coefficient and predicted good neuronal differentiation efficiency with an area under the curve (AUC) of 0.93 and 0.87 accuracy (95% confidence interval [0.78, 0.93]) (Figures S12A-S12C). Repeated 5-fold cross-validation achieved a mean AUC of 0.85 with 0.03 SD. Out of these ten genes with a non-zero coefficient, five were significantly correlated with neuronal differentiation efficiency (false discovery rate [FDR] < 1%; Figures 3E-3I and S12D). Three (HNRNPH3, SRSF5, and HSD17B6) of these associated genes were positively correlated with neuronal differentiation efficiency. Moreover, the expression of these genes was significantly reduced as iPSCs were differentiated to DA neurons (adjusted p < 0.05 from day 0 to 65; Figure 3J), suggesting that their high expression in iPSCs may represent an increased differentiation potential. Previous studies have shown SRSF5 is associated with neuronal differentiation efficiency (R = 0.25, adjusted p = 0.013) 15 and that SRSF5 binds to pluripotency-specific transcripts and positively correlates with the cytoplasmic mRNA levels of pluripotency-specific factors. 17 Interestingly, HNRNPH3 is also a known RNA-binding protein, suggesting that regulation of RNA binding may be an important pathway for neuronal differentiation. The remaining two associated genes (ZSWIM8 and ARSA) were negatively correlated with neuronal differentiation efficiency, and their overall expression was significantly increased during differentiation (adjusted p < 0.05 from day 0 to 65; Figure 3J).
Establishing regulatory maps of iPSC-derived DA neurons To identify epigenetic and regulatory features of genes in iPSCs and differentiated DA neurons, we generated DNA methylation, assay for transposase-accessible chromatin using sequencing (ATAC-seq; both bulk and single-cell), HiC sequencing and small RNA-seq data across several time points. DNA methylation data from bulk cultures were generated at days 0 (n = 97 after quality control [QC], including five technical replicates) and 65 (n = 82 after QC, including three technical replicates). These data were generated to assess changes in epigenetic patterns that potentially regulate gene transcription. The methylation data showed clear separation between both time points ( Figure S13). Additionally, marker genes such as MAP2 and TH showed a significant reduction in methylation from iPSCs at day 0 to DA neurons at day 65 (Figures S14A-S14E).
ATAC-seq is a commonly used technique to assess genomewide chromatin accessibility. Bulk ATAC-seq was generated from cultures at days 0 (n = 99), 25 (n = 97), and 65 (n = 94),  Figure S15A). Interestingly, we observed an increase in evolutionary sequence conservation at merged peak sets in more differentiated cells, where the lowest PhastCons score 19 across all peak sets was at day 0 and the highest at day 65 ( Figure S15B).
To provide a cell-type-specific view of chromatin accessibility in our differentiated cells, we generated single-cell ATAC-seq (scATAC-seq) at day 65 for a subset of the samples (n = 27 + 2 replicates). Following quality control, 139,659 cells were retained, with an average of 4,816 cells per sample (range: 944 to 11,649). We identified similar broad cell types as in the scRNA-seq data ( Figure 4B). However, the percentage of immature DA neurons and progenitor cell types was different between the scRNA-seq and scATAC-seq datasets. Cell-type-specific chromatin accessibility was observed at particular genes of interest. For example, a distinct peak was identified at the promoter of TH in bulk ATAC-seq at days 25 and 65 that, when examined in scATAC-seq, only appeared in the DA neuron cluster ( Figure 4C). Overall, ATAC-seq reads were enriched at the promoters of expressed genes, but it is important to note that not all genes in this region had peaks at their promoters in either bulk or scATAC-seq, reflecting cell-type specificity. ATAC-seq is also known to identify cell-type-specific intergenic regulatory regions. Reflecting this, we observed peaks at putative regulatory regions upstream of TH that were restricted to the progenitor and DA neuron populations, suggesting that these sequences may play a role in priming TH expression. A peak identified at the promoter of MAP2 in bulk ATAC-seq at days 25 and 65 also appeared as a broader neuronal marker in all cell types identified in scATAC-seq, except for the neuroepithelial-like cells, which are a non-neuronal cell type ( Figure S16).
HiC-seq is a method used to identify three-dimensional chromosome interactions (chromosome loops). These loops are known to be involved in regulating gene transcription by enabling physical interactions of enhancers with their cognate promoters. 20,21 These data can be particularly useful for linking distal risk loci/variants with regulatory regions and genes. HiC-seq data were generated for a subset of batch 1 at days 0 (n = 9) and 65 (n = 8) due to the large number of cells required as input for this assay. The HiC chromosome loops showed clear separation of both time points, and marker gene MAP2 showed distinct differences in HiC loop patterns (Figures S17 and S18). To complement the other gene expression and regulatory data, we performed small RNA-seq to investigate various classes of small RNAs, including microRNAs (miRNAs; Piwi-interacting RNAs [piRNAs], tRNA fragments) and other non-coding RNAs less than 50 bp, which are often involved in gene silencing and posttranscriptional regulation of gene expression. Small RNA-seq was generated at days 0 (n = 99), 25 (n = 98), and 65 (n = 96), with each time point including the control line with five technical replicates. Separation was seen between all time points ( Figure S19). The miRNAs that were significantly upregulated between day 0 and 65 are enriched for those that are highly expressed in tissues from the CNS when examined across 34 different tissues 22 ( Figure S20).

Longitudinal imaging of iPSC-derived DA neurons
To assess the relationship between the various molecular readouts described above and neuronal phenotypes, we performed longitudinal imaging and single-cell analysis. Cell-based imaging can be a valuable complementary approach to molecular analyses for characterizing phenotypes. To perform longitudinal single-cell analysis, 10 out of 95 iPSC lines differentiated into DA neurons (batch 1) were frozen on day 25 of differentiation. Frozen neurons were thawed, plated in 96-well dishes, matured for an additional 25 days, and transduced with a lentivirus to express GFP under the control of a synapsin I (SYN1) promoter on day 50. To focus our analysis on the subpopulations of cells perceived to be most relevant to PD, we expressed GFP from a SYN1 promoter to restrict marker gene expression to relatively mature neurons. Fluorescence became visible within a day of transfection, and robotic microscopy 23 was used to image cells every 24 h for approximately 10 days. Cells exhibiting GFP fluorescence had the characteristic morphological features of relatively mature DA neurons ( Figure 5A). The GFP morphology signal was used to unambiguously identify individual neurons and to track each cell from one imaging time point until the next. Because of its ability to track individual cells, robotic microscopy can monitor whether and how phenotypes change over time and obtain a cumulative measure of phenotypic endpoints that better controls for variability and increases sensitivity of comparisons of phenotypes between cohorts. Live neurons could be followed throughout the duration of the experiment. Representative neuron survival over 6 days is shown in Figure 5A. Cell death was detected as an abrupt loss of signal, indicative of a loss of membrane integrity ( Figure 5A). In total, 2,992 cells were analyzed across the 10 lines. The time required for complete loss of signal (time of death) from hundreds of neurons was analyzed with the Kaplan-Meier survival model, 24 and cumulative risk of death curves were generated ( Figure 5B). Comparison of lines from individuals harboring GBA1 mutations compared with HC lines demonstrates a significantly increased risk of death in GBA1 lines ( Figure 5C).
Testing the contextual fit of iPSC-derived DA neurons for modeling PD-related genetic risk We identified a wide genetic risk spectrum across the iPSC lines that we studied (Table 1; Figure S1). In addition to the contribution of genetic risk from known damaging variants in GBA1, LRRK2, and SNCA, there is a substantial common risk element that can be quantified by polygenic risk score, as previously shown using GWASs. 2 One limitation of GWASs is that they often cannot identify the causal variants, genes, or relevant cell type for each locus without additional gene expression or functional data. A method commonly used to infer cell-type relevance based on GWAS statistics is multi-marker analysis of genomic annotation (MAGMA). This method relies on the convergence of unbiased genetic risk maps with single-cell expression data; Cell Genomics 3, 100261, March 8, 2023 9 Resource ll OPEN ACCESS the enrichment of expression of genes from risk loci in individual cell types acts as a powerful indicator of cell-type relevance. 25 Previous analysis using mouse and human brain expression data has shown that DA neurons are a critical cellular context for PD-related genetic risk. 2,4 Analysis of the scRNA-seq expression data from this study revealed a dramatic enrichment of expression of genes within PD-linked risk loci in the two identified DA cell types (immature and DA neurons) relative to the other cell types ( Figure 6A; Table S6). Combined with the comparisons detailed above, these data reveal that this model resembles human brain neurons and provides a cellular context that is appropriate for modeling complex genetic risk in PD.
In an effort to nominate potential causal genes and molecular mechanisms tagged by each GWAS locus, we combined wholegenome sequencing with our scRNA-seq data in differentiated cells to identify expression quantitative trait loci (eQTLs) in each broadly defined cell type. Using this approach, we replicated known eQTLs in the KANSL1 and LRRC37A region reflecting the H1/H2 MAPT haplotypes ( Figures S21A-S21D). When exploring the eQTL results further, we specifically focused on the 90 risk variants from the most recent GWASs in PD. 2 Multiple variants in this dataset showed significant eQTL associations in at least one of the defined cell types in our DA neuron scRNA-seq (Table S7). For example, the locus with rs11950533 as the lead variant harbors at least 25 genes ( Figure 6B), and based on the PD GWAS browser prioritization tool, 26 four (CAMLG, JADE2, TXNDC15, and SAR1B) were prioritized based on their high correlation between cortical brain eQTL data 27 and PD GWAS signal ( Figures S22A-S22D). In the current FOUNDIN-PD scRNA-seq expression data, an eQTL for CAMLG was identified ( Figure 6C), which shows high correlation with the PD GWAS signal (Figure 6D). However, no eQTL signals were identified for JADE2, SAR1B, or TXNDC15 ( Figures S23A-S23C), despite all genes being expressed in our DA neurons ( Figure S24). Inspection of the CAMLG bulk RNA-seq eQTL signal and the PD risk signal intersection revealed that this eQTL was not detected at the iPSC state at day 0 but became detectable at day 65 ( Figure S25). This suggests that the regulatory effect signal or trajectory of CAMLG expression may correspond with differentiation to DA neurons. Therefore, based on FOUNDIN-PD data, CAMLG should be prioritized further as a candidate for functional follow up to confirm the association between CAMLG and PD risk.
PD risk and scRNA eQTL signals for DA neurons also intersected with other independent PD risk loci including TBC1D5, Resource ll OPEN ACCESS PRCP, CCAR2, ARIH2, and CCDC58. In these genes, the PD risk variant appears to be more statistically significantly associated with expression in DA neurons when compared with other cell types detected in the FOUNDIN-PD resource. This intersection of disease risk and DA neuron expression effect appears to be specific to DA neurons for TBC1D5 ( Figure S26), CCAR2 ( Figure 7B), and ARIH2 ( Figure S27), whereas PRCP and CCDC58 (Figures S28 and S29) and CAMLG ( Figure S30) showed signals that intersected PD risk signals in multiple cell types. Inspection of additional FOUNDIN-PD resources such as the single-cell ATAC peaks revealed instances of peaks that were elevated in DA neurons compared with other cell types and that contained ( Figures 7C and 7D) variants that were in high LD with the PD risk index variants at the locus. For TBC1D5, CCAR2, and ARIH2, the signal intersection was more specific to DA neurons, but, when inspecting the bulk RNA (Figure 7E) data, the signal was not present.

The FOUNDIN-PD data portal
To allow rapid and easy data access to researchers, gene-and region-level views of data are available through a web-based portal (https://www.foundinpd.org) integrating the multi-omics data types (Figure 1). All users can access summary-level data for a region (<5 Mb) or gene by registering with a single sign-in Google account. A single-integrated view allows for visualization of genomic data by genomic coordinates with tracks available for scRNA-seq, scATAC-seq, bulk RNA-seq, bulk ATAC-seq, methylation arrays, HiC-seq, and small RNA-seq, among others. The portal is interactive, allowing the dynamic ability to view facets/partitions of data split by LRRK2/GBA1/SNCA status, sex, and diagnosis. The tracks are responsive for dynamic zooming and panning by touch or mouse and can be reordered or hidden from views. Users can download data backing the graphs via CSVs and export screen snapshots. Users who are authenticated for access to individual-level data via https://www. ppmi-info.org/ will also have the ability to visualize individuallevel data. Additional phenotypic detail is available, and users can, for example, dynamically plot expression versus SNP genotype or many other variables available on subjects. The portal contains links to several additional access points, including PPMI-INFO for individual-level data and a GitHub site (https:// github.com/FOUNDINPD) with analysis and standard operating procedures (SOPs). Finally, a specific single-cell view of the data is available via an embedded cellxgene instance, 29 providing uniform manifold approximation and projection (UMAP) and PCA views. Through this interface, users can view identified clusters, genes, and gene families across profiled cells.

DISCUSSION
Genetic understanding of disease is the first step on the path from biological insight and target identification to the development of mechanistic-based treatment. However, in order to translate genetics to biology, we require an ability to model the influence of genetic risk in a contextually appropriate system and to generate replicable disease-relevant readouts. The rapidly growing number of genetic risk variants and mutations  15 and are mainly attributed to cell-intrinsic factors maintained over multiple freeze-thaw cycles. One of the factors driving inefficient differentiation toward specific cell types seems to be the heterogeneity of endogenous WNT signaling between iPSC lines, 33 meaning that efficient patterning to DA neurons is dependent on the concentration of the GSK3 inhibitor/WNT activator (CHIR99021) and would need to be optimized for each iPSC line. 34,35 However, performing such optimization for all FOUNDIN iPSC lines would have been costly and time-consuming. To minimize such variability between iPSC cell lines, researchers developed strategies such as the selection of well-characterized cell lines for specific applications 36,37 and large-scale collaborative projects. 38 However, such strategies can only be applied to projects developed with a small number of iPSC lines.
The inclusion of single-cell methods, which emerged into general usage during the execution of this study, has clearly been of great benefit to FOUNDIN-PD. These data help overcome the cell-type heterogeneity of differentiated ''mixed'' cultures, provide a cellular context for genetic risk, and also have the capacity to reveal transcriptomic and regulatory features specific to the disease-relevant cellular context. Therefore, the role of bulk RNA-seq indeed has changed. The bulk RNA-seq is now complementary to the scRNA-seq data and provides certain benefits that the single-cell data cannot, including the much higher sequencing depth, which therefore allows investigating of splicing events and isoforms, which can be combined with deconvolution analysis. Overall, based on our observations thus far, the expansion of these methods to include single-cell transcriptomics combined with ATAC-seq, single-cell HiC, singlecell chromatin immunoprecipitation methods to reveal transcriptional factor targets, and single-cell proteomics will add more resolution to the FOUNDIN-PD study and more disease-relevant insights. Inclusion of these single-cell data will be a key part of the next stage of FOUNDIN-PD.
Finally, we believe that longitudinal imaging of intact cells can complement the molecular analyses and add significantly to the characterization of patient-derived iPSC lines and to our goal to conduct functional genomics for PD. 39 FOUNDIN-PD includes an extensive set of molecular analyses, but we recognize that some potentially important classes of bioactive molecules (e.g., lipids, metabolites) and functions (e.g., electrical activity) were not measured. For some assays, important subcellular spatial relationships of the macromolecules are necessarily lost during sample preparation. Imaging provides a method of studying cells as intact living systems, preserving critical components and their spatial relationships in situ, and enabling functional measurements relevant to PD that would be difficult to infer from reductionist molecular analyses. As noted above, there are inherent challenges associated with understanding how genetic variants implicated in PD contribute to disease. The effect size of individual variants is often small, making functional effects hard to detect, and it may be the case that substantial disease risk for an individual is conferred through the combined nonlinear effects of multiple variants. If so, combining imaging with molecular analyses may be particularly helpful because it provides an approach to study the integrated effect of genetic variants on specific cell functions relevant to disease. Finally, imaging data are especially amenable to powerful machinelearning types of analyses, which can be used to discover biological insights from images that elude the human eye 40 and provide a computational framework for integrating other data types, including types of multi-omics data produced by FOUNDIN-PD. Indeed, next steps include multi-omics data integration to systematically understand and identify PD-relevant pathways. This integration provides an opportunity to investigate PD-relevant biological pathways at multiple layers like the genotype, chromatin, and transcript levels.

Limitations of the study
The efficiency and reproducibility of the DA neuron differentiation protocol was not previously explored on the large set of iPSC lines. Here we identified, as expected, that there is substantial line-to-line variation. Interestingly, we were able to identify early expression markers that correlate with the potential to generate high levels of DA neurons in the FOUNDIN-PD cell lines; therefore, it is tempting to speculate that sorting iPSCs based on a high expression of, for example, SRSF5 may improve differentiation efficiency. These results are in line with a previous report showing that expression markers detected at the iPSC stage can robustly predict differentiation capability. 15 While the emergence of single-cell molecular methods relieves some concerns regarding cellular heterogeneity, improving differentiation consistency line to line would be of benefit. We acknowledge that this dataset is underpowered to reveal all but the strongest of mechanisms associated with complex disease risk loci. Additionally, while iPSCs are a useful model, they have limitations, including the fact that DNA methylation signatures from the donor are not preserved upon reprogramming, and, therefore, they lack aging-related phenotypes, which is the biggest common risk factor for PD. While the number of lines required to generate insights at the remaining loci will vary from risk allele to risk allele, we believe that the next stage of FOUNDIN-PD should include a significant increase in scale. Notably, as initiatives such as the Global Parkinson's Genetics Program (GP2) focus on diversifying the ancestral basis of our genetic understanding in PD, 41 efforts such as FOUNDIN-PD should also prioritize the generation of reference data in well-powered ancestrally diverse systems. We also see the value in diversifying our terminal differentiation target to include other cell types potentially relevant for PD.

Conclusions
Overall, we present here the first data release of the FOUNDIN-PD project, which includes multi-omics and imaging data on iPSCs differentiated to DA neurons of 95 PPMI participants harboring a range of genetic risks from fully penetrant causal mutations to carriers of combinations of risk alleles identified by GWASs. We believe the FOUNDIN-PD data will serve as a foundational resource for PD research with easily accessible data and browsers designed for basic scientists. This dataset will help the community to better understand the mechanisms of PD, identify new disease-relevant targets, and potentially impact the development of novel therapeutic strategies.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Sample selection and batching
Upon receiving, all cell lines were NeuroChip array genotyped 62 to confirm sample origin and to assess if large genomic events occurred during reprogramming, iPSC culture and differentiation. The data were compared to donor (blood derived) whole-genome sequencing (WGS) to identify large genomic abnormalities. Of 134 subjects, 80 are males and 54 females. The cell line collection included healthy controls, PD cases without mutations in PD-related genes, and affected and unaffected individuals harboring damaging point mutations including SNCA p.A53T, LRRK2 p.G2019S, LRRK2 p.R1441G, GBA1 p.E326K, GBA1 p.T369M and GBA1 p.N370S. Note that one iPSC line carries both LRRK2 p.G2019S and GBA1 p.N370S and another iPSC line carries both LRRK2 p.G2019S and GBA1 p.T369M. Given the much larger effect size on PD risk of LRRK2 p.G2019S, these lines were annotated as LRRK2+, but with a comment that they also carry a GBA1 mutation. From the 134 cell lines, 95 passed QC and were selected for DA neuron differentiation and split into five batches (Table 1). One control cell line was included in each batch as a technical replicate (n = 5) totaling 99 samples (Table S1).

Differentiation of iPSC into dopaminergic (DA) neurons
The PPMI iPSC lines were thawed and grown on matrigel (Corning)-coated plates with Essential 8 Flex (E8, Batches 1, 2 and 3) or Essential 6 (E6, Batches 4 and 5) media (both Gibco) for about one month (5 passages). Essential 6 medium was supplemented with growth factors to become equivalent in composition to Essential 8. Upon reaching 70-80% confluency, iPSC lines were dissociated into a single-cell suspension with Accutase (Gibco) and plated at 200,000 cells/cm 2 on matrigel-coated one-well plates (barcoded, Greiner) suitable for automated cell culture. Cells were grown until they covered the plate surface, usually 24-48 h after single-cell plating. The time required to reach confluence was variable and dependent on the growth rate of each iPSC line. The DA differentiation protocol was adapted from Kriks and collaborators 9 with minor modifications. 10 Differentiations were carried out in an automated cell culture system 11 with manual replatings on days 25 and 32 for final differentiation and immunocytochemistry (ICC), respectively. 11 Samples for assays were collected on days 0 (iPSC), 25 (mid-point) and 65 (DA neurons). For DNA assays, cells were dissociated with Accutase, washed once with PBS and spun down at 200 g. The cell pellet was snap-frozen or processed according to assay protocols. Most of the RNA assays required snap-frozen cells collected by scraping the plate surface with PBS or lysis buffer. Single-cell (sc) RNA-seq and scATAC-seq assays required a single cell suspension prepared in 0.04% human serum albumin (HSA)/PBS. All samples were stored at À80 C until further processing. For cryopreservation, day-25 DA neuron precursors were dissociated with Accutase, washed once with neurobasal medium (Gibco), resuspended in cold Synth-a-Freeze cryopreservation medium (Gibco) supplemented with 10 mM Y-27632 and aliquoted into barcoded cryovials (NovaStora) at 10x10 6 cells/ml/vial (on ice). The cryovials were placed in CoolCell cell freezing containers (Biocision), kept overnight at À80 C and transferred to liquid nitrogen for long term storage.

Immunocytochemistry (ICC) and image analysis
Cells were differentiated until day 65, fixed in 4% PFA, washed 3 3 5 min in PBS and blocked in 5% goat serum/1% BSA/0.1% Triton X-100/PBS for 1 h at room temperature (RT). Primary antibodies were applied overnight at 4 C and included TH (Pel-Freez Biologicals #P40101 and Merck Millipore #AB9702, both at 1:750 dilution), MAP2 (Santa Cruz #sc-74421, 1:100) and TUJ1 (R&D #MAB1195, 1:500). After incubation with primary antibodies, cells were washed 3 3 5 min in PBS. Cells were incubated with second antibodies (AF488 and AF594, Invitrogen, 1:1000) for 2 h at RT followed by nuclear counterstaining with Hoechst 33,342 (Invitrogen, 1:8000) for 30 min at RT. Finally, cells were washed 3 3 5 min in PBS and imaged with a CellVoyager 70003(Yokogawa) confocal microscope and 203 objective. Images were analyzed on Columbus (PerkinElmer) as described previously. 11 The total number of TH (DA neuron) and MAP2 or TUJ1 (neuron) positive cells was estimated and normalized to the total number of nuclei. Data is represented as the percentage of positive cells per 30 fields.

Longitudinal image analysis of iPSC DA neurons
Frozen day-25 DA neuron precursors were thawed and replated at a density of approximately 450,000 cells/cm 2 onto dishes coated with 0.1 mg/ml poly-l-ornithine (PLO), 5 mg/ml laminin, and 5 mg/ml fibronectin in NB/B27 medium prepared as described 10 with the addition of 10mM ROCKi and 100 mg/ml matrigel (Corning). The media was changed 4 h later to remove ROCKi. DA neurons were matured in NB/B27 medium, then replated into 96-well plates on day 49. On day 50, cells were transduced with synapsin-driven GFP via lentivirus (SignaGen), followed by a media change the next day. Cells were imaged daily from approximately day 54 through 66 using robotic microscopy, a previously described automated imaging platform. 23,24 Images obtained from 8 consecutive days were processed using custom programs in Galaxy 63,64 to assemble arrays of images into montages representing each well, and to stack montages across timepoints. Neuron survival was analyzed using a custom program written in MATLAB. Live neurons expressing GFP were selected for analysis only if they had extended processes at the first timepoint. Neurons were tracked longitudinally across timepoints until death, and survival time was defined as the last timepoint a neuron was seen alive. The survival package for R statistical software was used to construct Kaplan-Meier curves from the survival data. 24 Survival functions were fitted to these curves to derive cumulative risk-of-death curves. Statistical differences between groups were analyzed using the Cox mixed-effects model.

scRNA-seq
The BCL files obtained after sequencing were demultiplexed into FASTQ files using the cellranger ''mkfastq'' software and unique molecular identifier (UMI) gene counts were calculated by cellranger ''count'' software (v3.1.0). 69 UMI gene counts for each sample were merged into a table and imported into R (v3.6.0). We used Seurat (v3.1.1) 70 within the R environment for filtering, normalisation, integration of multiple single-cell samples, unsupervised clustering, visualisation and differential expression analyses. The following data processing was done: (1) Filtering. Cells with less than 1,000 and more than 9,000 genes expressed (R1 count) were filtered out, and only genes that were expressed in at least 100 cells were kept. Moreover, cells with more than 20% of counts in mitochondrial genes were filtered out. After filtering, there were 34,960 genes in 416,216 cells; (2) Data normalisation and integration. Gene UMI counts for each cell were normalised using the ''SCTransform'' function in Seurat. Integration of scRNA-seq data from multiple samples was performed using top 3000 variable features and top 3 samples as reference with the highest number of cells; (3) Clustering and visualisation. Clustering was performed using ''FindClusters'' function with default parameters except resolution was set to 0.5 and first 30 PCA dimensions were used in the construction of the shared-nearest neighbor (SNN) graph and to generate 2-dimensional embeddings for data visualisation using UMAP; (4) Differential expression analyses: We used ''FindAllMarkers'' function with default parameters and only tested genes that are detected in a minimum of 40% of cells in either of the two clusters. Genes with an adjusted p value < 0.05 were considered to be differentially expressed. The pipelines used in this study are available at https:// github.com/FOUNDINPD/FOUNDIN_scRNA.

scATAC-seq
The BCL files obtained after sequencing were demultiplexed into FASTQ files using the cellranger-atac ''mkfastq'' software and unique molecular identifier (UMI) counts were calculated by cellranger-atac ''count'' software (v1.2.0). 54 Peaks for each sample were merged into a table and imported into R (v3.6.0). We used Seurat (v3.2.0), Signac (v1.1.0) 53 and Harmony (v1.0) 71 within the R environment for filtering, normalisation, integration of multiple single-cell samples, unsupervised clustering, visualisation and predicting the cell types. The following data processing was done: (1) Filtering. We kept the cells with minimum 1,000 peaks (R1 count), respectively and the peaks that were called in at least 100 cells. Moreover, cells with more than 20% of counts in mtDNA were filtered out. After filtering, there were 459,495 peaks in 139,659 cells; (2) Data normalisation and integration. Peak counts for each cell were normalised using the ''RUNTFIDF'' function in Signac that performs term frequency-inverse document frequency normalisation followed by SVD decomposition to generate latent semantic indexing (LSI). Integration of scATAC-seq data from multiple samples was performed using the ''RUNHarmony'' function with LSI reduction; (3) Clustering and visualisation. Clustering was performed using the ''FindClusters'' function with default parameters except resolution was set to 0.1 or 0.2. First 30 harmony dimensions were used to generate 2-dimensional embeddings for data visualisation using UMAP; (4) Predicting cell types: Fragments in the genes (extended 2kb upstream) were calculated for each cell to generate a gene activity matrix and normalised the data using the ''LogNormalize'' method. Cell types were predicted using scRNA-seq data as a reference and scATAC-seq data as a query for ''FindTransferAnchors'' and ''TransferData'' functions. Prediction often results in heterogeneous cell type annotation with-in the same cluster. We assigned the cell type to a cluster with the maximum occurrence. The neuroepithelial-like cluster was separated using 0.2 resolution. The pipelines used in this study are available at https://github.com/FOUNDINPD/FOUNDIN_scATAC.
Prediction of neuronal differentiation efficiency using bulk RNA-seq data at day 0 To test the predictive value of the genic expression profile in iPSC for neuronal differentiation efficiency, we performed supervised machine learning (logistic regression) on the DESeq2 v1.26.0 49 normalised count expression values for genes at day 0 and estimated DA neurons fractions from the differentiated cell lines at day 65. DA neuron fractions were calculated from scRNA-seq data, based on the total number of cells and the number of cells in the 'Dopaminergic Neurons' cluster (see the STAR Methods section for scRNAseq). Cell lines were classified into high (n = 62) and low (n = 21) differentiation efficiency classes based on the relative abundance of the DA neurons at day 65; as a threshold for classification, we used first quartile value of cell percentages (Q25 = 15.7%), as it was best separating the two observed distribution peaks of DA neuron counts across the cell lines.
To reduce possible bias in the predictive model, we used a full set of reliably expressed genes (threshold of inclusion mean normalised count R50). As we did expect a significant number of genes to be highly correlated with one another in their expression, with multitude of them being possibly relevant for prediction, and the total number of relevant features for our model is unknown, we resolved to using elastic net regularisation approach, which combines both lasso regression (shrinking less important features and pruning some) and ridge regression (assigns proportional coefficients to highly correlated possibly relevant features and prevents model overfitting) to equal degree (alpha = 0.5) with a penalty lambda equal to 0.22. To further control for possible overfitting, repeated (100 times) 5-fold cross-validation was performed using the ''cv.glmnet'' function. Data preprocessing and logistic regression was executed in R (v3.6.3), 72 using packages caret (v6.0-86) 55 for model training and glmnet (v4.0) 56 for elastic net regularisation of the model and repeated cross-validation. As the sample size is small and imbalanced, we directly tested the relation of the resulting predictive candidate genes' expression to the percentage of DA neurons in each cell line. We performed Spearman's rank correlation test, using R package stats (v3.6.3). 72 Benjamini & Hochberg procedure was used for multiple testing corrections of p value.
MAGMA to identify causative cell types Expression gene profiles obtained from the scRNA-seq dataset were used to test for a cell type association with PD. We used the R package MAGMA_Celltyping (v1.0.0, https://github.com/NathanSkene/MAGMA_Celltyping), which utilises MAGMA 25 software Cell Genomics 3, 100261, March 8, 2023 e7 Resource ll OPEN ACCESS package as a backend, to identify cell types positively associated with the common-variant genetic hits from the most recent PD GWAS. 2 LD regions were calculated using the European panel of 1000 Genomes Project Phase 3. 73 The cell type enrichment analysis was performed on 5000 subsampled cells from each scRNA-seq cluster.
Single-cell expression quantitative trait loci analysis Variants from the whole-genome sequencing data were correlated with normalised average gene expression levels per cell cluster by performing single-cell expression quantitative trait loci analysis. After quality control, 77 samples were included for analysis and expression data were filtered for 0.025 average expression in all samples. Then genes were removed with zero expression in 15 or more samples resulting in expression of 1256 genes across 90 risk variants. eQTL analysis was performed using MatrixEQTL 57 including variants with minor allele frequency >5% and using the following covariates: batch, sex, age of donor, GBA1, SNCA, LRRK2, phenotype, TH + levels, MAP2+ levels, number of cells, reads per cell, total genes detected, and median UMI counts per cell. Overlap between eQTL variants and GWAS was determined using the most recent PD GWAS. 2 For GWAS loci of interest, violin plots were generated to visualise the correlation between genotype and gene expression. Additionally, LocusZoom 58 and LocusCompare 59 plots were generated to visualise correlations between GWAS signal and eQTL signal and the PD GWAS locus browser was used for loci numbering and prioritisation. 26 The analysis pipeline can be found here: https://github.com/ FOUNDINPD/SCRN_EQTL_v2. Bulk eQTL analysis was performed separately on day 0, 25, and 65 data using tensorQTL 60 and included estimated cell fractions as covariates. The estimated cell fractions were generated using the Scaden 61 deconvolution tool trained on the day 65 single-cell data.

FOUNDIN-PD browser
Architecturally, the FOUNDIN-PD portal is a single-page application (SPA) framework where a public javascript application interacts with a secured JSON API to build the user DOM within the user browser. The client-side nature of the application allows for dynamic interactions with the user with low latency and high scalability, leveraging the fact that many users will leverage modern computing and browsing capabilities. At a granular level, the FOUNDIN-PD application is based on JavaScript ECMAScript 2016 and builds upon Vega.js (vega.github.io; version 5.22) visualisation grammar and D3.js (https://d3js.org; version 6) for dynamic responsive graphing. The API is within a sharded MongoDB 4.2 (https://mongodb.com) framework on a CentOS8 cloud server using an NGINX (https://nginx.org; 1.18) proxy, NodeJS 12 middleware (https://nodejs.org), to provide a protected JSON API. API data is secured using JSON/JWT authentication via Auth0 (https://auth0.com) and Google OAUTH 2.0 (https://oauth.net/2/) for the identification of users.

QUANTIFICATION AND STATISTICAL ANALYSIS
All analysis details pipelines for data in these studies can be found in the STAR Methods sections and on the FOUNDIN-PD GitHub (https://github.com/FOUNDINPD). Additionally detailed SOPs for each assay can be found at the PPMI portal (https://www. ppmi-info.org/).